RED WINE EDA

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Which chemical properties influence the quality of red wines?

Univariate Plots Section

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

# using a for loop to print all the variables histograms

plots_uni <- list()

for (nm in names(wines)) {
  plots_uni[[nm]] <- ggplot(aes_string(x=nm), data=wines ) + 
    geom_histogram( binwidth = .1 )
  print(plots_uni[[nm]])
}

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wines in the dataset with 11 attributes: 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume)

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature because we are investigating which chemical properties influence the quality of red wines.
Quality looks like a normal distribution, lots of 5 and 6 rankings, less 4 and 7, very few 3 and 8

What other features in the dataset do you think will help support your  investigation into your feature(s) of interest?

  • some wines have no citric acid
  • most wines between 9-12% alcohol
  • wines sulphur dioxide content is long tailed
  • free sulfur dioxide appears to correlated to total sulfur dioxide

Did you create any new variables from existing variables in the dataset?

  • Not yet… but dropped X variable as it does not impact Quality

Of the features you investigated, were there any unusual distributions?  Did you perform any operations on the data to tidy, adjust, or change the form  of the data? If so, why did you do this?

  • Not yet… but dropped X variable as it does not impact Quality

Bivariate Plots Section

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500
# lets plot all the boxplots for all combinations of variables
# using a for loop

plots_bi <- list()

for (i in 1:(length(names(wines))-1)) {
  plots_bi[[nm]] <- ggplot (aes(x = quality, y = wines[,i] , group=quality), data = wines) +
    geom_boxplot() +
    ylab(names(wines)[i])
  print(plots_bi[[nm]])
  
# fixed acidity does not appear to correlate with quality 
# volatile acidity appears to negatively correlate with quality
# citric acid appears to positively correlate with quality
# residual sugar needs further investigation because of lots of 
# outliers stretching the chart
#chlorides needs further investigation because of lots of 
# outliers stretching the chart
# free sulfer dioxide does not appear to correlate with quality
# total sulfer dioxide does not appear to correlate with quality
# density appears to be negatively correlated with quality
# pH appears to be negatively correlated with quality
# Sulphates appears to be positively correlated with quality
# Alcohol appears to be negatively correlated with quality  

}

## 
##  Pearson's product-moment correlation
## 
## data:  wines$volatile.acidity and wines$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  wines$citric.acid and wines$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## 
##  Pearson's product-moment correlation
## 
## data:  wines$sulphates and wines$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## 
##  Pearson's product-moment correlation
## 
## data:  wines$alcohol and wines$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

# calculate all correlations instead of one at a time
# using ggcor

ggcorr(wines, label=TRUE, hjust = 0.75, size = 3, color = "grey50", label_size = 3, layout.exp = 1, label_round = 2, label_alpha = TRUE)

# should have done this at the beginning of my analysis

Bivariate Analysis

Talk about some of the relationships you observed in this part of the  investigation. How did the feature(s) of interest vary with other features in  the dataset?

Volatile Acidity and Quality (-0.39) Citric Acid and Quality (+0.23) Total SUlfur Dioxide and Quality (-0.19) Sulphates and Quality (+0.25) Alcohol and Quality (+0.48)

Did you observe any interesting relationships between the other features  (not the main feature(s) of interest)?

Volatile Acidity and Citric Acid (-0.55) Volatile Acidity and Sulphates (-0.26)

Fixed Acidity and Density (+0.67) Fixed Acidity and pH (-0.68) Citric Acid and pH (-0.54) Free Sulphur Dioxide and Total Sulpher Dioxide (0.67) Density and Alcohol (-0.50)

What was the strongest relationship you found?

The strongest relationship with Quality were:

alcohol (0.476) Volatile Acidity (-0.390) sulphates (0.251) Citric Acid (0.226)

Multivariate Plots Section

plots_mu <- list()

for (i in 1:(length(names(wines_sub))-1)) {
  for (j in 1:(length(names(wines_sub))-1)) {
    plots_mu[[nm]] <- ggplot(aes(x = wines_sub[,i] , y = wines_sub[,j],
                                 color=factor(quality)), data = wines_sub) +
      geom_point(alpha = 0.5, size = 1, position = 'jitter') +
      scale_color_brewer(type = 'div', palette = 'Spectral', 
                         guide = guide_legend(title = 'Quality', reverse = T,
                                              override.aes = list(alpha = 1, size = 2))) + 
      xlab(names(wines_sub)[i]) +
      ylab(names(wines_sub)[j])
    print(plots_mu[[nm]])
}
}

Multivariate Analysis

Talk about some of the relationships you observed in this part of the  investigation. Were there features that strengthened each other in terms of  looking at your feature(s) of interest?

The combination of high Sulphates and high alcohol seems to result in the highest Quality scores.

Both citric acid and volatile acidity had correlations with quality of red wine. Volatile acidity negatively correlated with quality. Citric acid positively correlated with quality. Citric acid is also negatively correlated with volatile acidity and positively correlated with fixed acidity.

Were there any interesting or surprising interactions between features?

Residual sugar was not correlated with Quality scores

OPTIONAL: Did you create any models with your dataset? Discuss the  strengths and limitations of your model.


Final Plots and Summary

Plot One

ggplot(aes(x = quality), data = wines) +
  geom_histogram(binwidth = 1) +
  xlab("Quality") +
  ylab("Number of Red Wines") +
  ggtitle('Red Wine Quality counts')

Description One

The distribution of Red Wine Quality appears to be normal. A large majority of Red Wine in the dataset were given 5 or 6 Quality scores. There are very few Red Wines with scores 3 or 8. There are no scores less than 3 or greater than 8.

Plot Two

ggplot (aes(x = quality, y = alcohol , group=quality), data = wines) +
  geom_boxplot(color = 'blue') +
  geom_point(alpha = 0.1,
             position = position_jitter(h=0),
             color = 'grey50') +
  xlab("Quality") +
  ylab("Alcohol Percentage (%)") +
  ggtitle('Alcohol Percentage (%) by Quality')

Description Two

Red Wines with Quality of 3, 4, and 5 had low Alcohol Percentages, median around 10%. Increasing Quality Red Wines had increasing median Alcohol Percentages.

Plot Three

ggplot(aes(x = alcohol, y = sulphates, color=factor(quality)), data = wines) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div', palette = 'Spectral',
    guide = guide_legend(title = 'Quality', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +
  ggtitle('Quality by Alcohol and Sulphates') +
  xlab("Alcohol Percentage (%)") +
  ylab("potassium sulphate (g/dm3)") +
  scale_x_continuous(limits = c(9, 14)) +
  scale_y_continuous(limits = c(0.3, 1.1)) +
  geom_smooth(method='lm', se=FALSE, size=1)

Description Three

Red Wines with Quality score of 3 typically had low Sulphates and low Alcohol Percentages. Higher levels of Sulphates and Alcohol Percentages resulted in higher scoring Quality in red wines.


Reflection

The Red Wines data set contains information on almost 1,599 red wines. I started by understanding the individual variabltes in the data set using plots. Using many R libraries, I was able to determine which variables had a greatest impact on Quality and which variables were correlated.

I learned that citric acid correlated with quality of red wine. Citric acid is also negatively correlated with volatile acidity and positively correlated with fixed acidity. Any models would need to take this into account.

I struggled with figuring out which type of graph was best suited for this investigation, which resulted in creating more graphs than necessary, re-creating graphs and deleting graphes. I suspect with more experience I could spend more time planning which graphs I would need to draw conclusions instead of simply starting with graphs.

References

https://www.statmethods.net/management/subset.html https://stats.stackexchange.com/questions/177129/ggplot-and-loops

https://ourcodingclub.github.io/2017/02/08/funandloops.html http://rprogramming.net/rename-columns-in-r/

https://stackoverflow.com/questions/10085806/extracting-specific-columns-from-a-data-frame

http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html https://briatte.github.io/ggcorr/ https://bibinmjose.github.io/RedWineDataAnalysis/#correlation_matrix

https://stackoverflow.com/questions/31297196/continuous-value-supplied-to-discrete-scale http://ggplot.yhathq.com/docs/scale_color_brewer.html http://colorbrewer2.org/#type=diverging&scheme=Spectral&n=3